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Abstract 



Least squares approximation is a technique to find an approximate solution to a system 
of linear equations that has no exact solution. In a typical setting, one lets n be the number 
of constraints and d be the number of variables, with n 3> d. Then, existing exact methods 
find a solution vector in 0(nd 2 ) time. We present two randomized algorithms that provide 
very accurate relative-error approximations to the optimal value and the solution vector of a 

■ least squares approximation problem more rapidly than existing exact algorithms. Both of 
| our algorithms preprocess the data with the Randomized Hadamard Transform. One then 

uniformly randomly samples constraints and solves the smaller problem on those constraints, 
and the other performs a sparse random projection and solves the smaller problem on those 
CO ■ projected coordinates. In both cases, solving the smaller problem provides relative-error ap- 

proximations, and, if n is sufficiently larger than d, the approximate solution can be computed 

■ in 0{nd\ogd) time. 

1 Introduction 

' In many applications in mathematics and statistical data analysis, it is of interest to find an 

approximate solution to a system of linear equations that has no exact solution. For example, 
let a matrix A 6 M. nxd and a vector b 6 M n be given. If n 3> d, there will not in general exist a 
vector x 6 M. d such that Ax = b, and yet it is often of interest to find a vector x such that Ax « b 
in some precise sense. The method of least squares, whose original formulation is often credited 



to Gauss and Legendre [25], accomplishes this by minimizing the sum of squares of the elements 
of the residual vector, i.e., by solving the optimization problem 

Z = min \\Ax - b\L . (1) 

x£R d 

It is well-known that the minimum ^-norm vector among those satisfying eqn. (JTJ) is 

x opt = A^b, (2) 

where A^ denotes the Moore-Penrose generalized inverse of the matrix A [61 [16] . This solution 
vector has a very natural statistical interpretation as providing an optimal estimator among all 
linear unbiased estimators, and it has a very natural geometric interpretation as providing an 
orthogonal projection of the vector b onto the span of the columns of the matrix A. 
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Recall that to minimize the quantity in eqn. ([T]), we can set the derivative of \\Ax — 6|| | = 
(Ax — b) T (Ax — b) with respect to x equal to zero, from which it follows that the minimizing 
vector x op t is a solution of the so-called normal equations 

A T Ax opt = A T b. (3) 

Geometrically, this means that the residual vector b 1 - = b — Ax op t is required to be orthogonal to 

the column space of A, i.e., b ±T A = 0. While solving the normal equations squares the condition 
number of the input matrix (and thus is not recommended in practice) , direct methods (such as 
the QR decomposition |16j ) solve the problem of eqn. ([1]) in 0(nd 2 ) time assuming that n > d. 
Finally, an alternative expression for the vector x op t of eqn. ([2]) emerges by leveraging the Singular 
Value Decomposition (SVD) of A. If A = Ua^aVa denotes the SVD of A, then 

x pt = VAE^f/Jft. 

1.1 Our results 

In this paper, we describe two randomized algorithms that will provide very accurate relative- 
error approximations to the minimal i^-norm solution vector x op t of eqn. ([2]) faster than existing 
exact algorithms for a large class of overconstrained least-squares problems. In particular, we will 
prove the following theorem. 

Theorem 1 Suppose A E M. nxd , b G M n , and let e G (0,1). Then, there exists a randomized 
algorithm that returns a vector x op t £ M. d , such that, with probability at least 1/2, the following 
three claims hold: first, x op t satisfies 

\\Ax opt - b\\ 2 < (1 + e)Z; (4) 

second, if n(A) is the condition number of A and if we assume that 7 G [0, 1] is the fraction of 
the norm of b that lies in the column space of A (i.e., 7 = ||J7aJ7J6|| 2 / ||6|| 2 ; where Ua is an 
orthogonal basis for the column space of A ), then x op t satisfies 

\\x op t - x opt \\ 2 < x/i (k(A)^^- 2 - \\x opt \\ 2 ; (5) 

and third, the solution x op t can be computed in 0(ndlogd) time if n is sufficiently larger than d 
and less than e d . 

We will provide a precise statement of the running time for our two algorithms (including the 
e-dependence) in Theorems [2] (Section H]) and [3] (Section [5]), respectively. It is worth noting that 
the claims of Theorem [1] can be made to hold with probability 1 — 5, for any 5 > 0, by repeating 
the algorithm [log 2 (l/5)] times. Also, we will assume that n is a power of two and that the rank 
of the n x d matrix A equals d. (We note that padding A and b with all-zero rows suffices to 
remove the first assumption, whereas a simple modification of our analysis would also remove the 
second assumption.) 

Here, we provide a brief overview of our main algorithms. Let the matrix product HD denote 
the n x n Randomized Hadamard Transform (see also Section I2.4p . Here the n x n matrix H 
denotes the (normalized) matrix of the Hadamard transform and the n x n diagonal matrix D 
is formed by setting its diagonal entries to +1 or —1 with equal probability. This transform 
has been used as one step in the development of a "fast" version of the Johnson-Lindenstrauss 
lemma [H [18]. Our first algorithm is a random sampling algorithm. After premultiplying A 
and b by HD, this algorithm samples uniformly at random r constraints from the preprocessed 
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problem. (See eqn. (|22p . as well as the remarks after Theorem [2] for the precise value of r.) Then, 
this algorithm solves the least squares problem on just those sampled constraints to obtain a 
vector Xopt S M. d such that the three claims of Theorem [1] are satisfied. Note that applying the 
randomized Hadamard transform to the matrix A and vector b only takes 0(nd log r) time. This 
follows since we will actually sample only r of the constraints from the Hadamard-preprocessed 
problem [2]. Then, exactly solving the r x d sampled least-squares problem will require only 
0{rd 2 ) time. Assuming that e is a constant and n < e d , it follows that the running time of this 
algorithm is 0(nd\ogd) when = Q(d 2 ). 

In a similar manner, our second algorithm also initially premultiplies A and b by HD. This 
algorithm then multiplies the result by a kxn sparse projection matrix T, where k = 0{d/e). This 
matrix T is described in detail in Section [5. 2i Its construction depends on a sparsity parameter, 
and it is identical to the "sparse projection" matrix in Matousek's version of the Ailon-Chazelle 
result [H [18]. Finally, our second algorithm solves the least squares problem on just those k 
coordinates to obtain x op t £ M. d such that the three claims of Theorem [1] are satisfied. Assuming 
that e is a constant and n < e d , it follows that the running time of this algorithm is 0(nd log d) 
when n = fl(d 2 ). 

It is worth noting that our second algorithm has a (marginally) less restrictive assumption on 
the connection between n and d. However, the first algorithm is simpler to implement and easier 
to describe. Clearly, an interesting open problem is to relax the above constraints on n for either 
of the proposed algorithms. 

1.2 Related work 

We should note several lines of related work. 

• First, techniques such as the "method of averages" |10j preprocess the input into the form 
of eqn. ([6]) of Section [3] and can be used to obtain exact or approximate solutions to the 
least squares problem of eqn. (TjTJ in o(nd 2 ) time under strong statistical assumptions on A 
and b. To the best of our knowledge, however, the two algorithms we present and analyze 
are the first algorithms to provide nontrivial approximation guarantees for overconstrained 
least squares approximation problems in o(nd 2 ) time, while making no assumptions at all 
on the input data. 

• Second, Ibarra, Moran, and Hui |17j provide a reduction of the least squares approximation 
problem to the matrix multiplication problem. In particular, they show that MM(d)0(n/d) 
time, where MM(d) is the time needed to multiply two d x d matrices, is sufficient to 
solve this problem. All of the running times we report in this paper assume the use of 
standard matrix multiplication algorithms, since o(d 3 ) matrix multiplication algorithms are 
almost never used in practice. Moreover, even with the current best value for the matrix 
multiplication exponent, uj ~ 2.376 [9], our algorithms are still faster. 

• Third, motivated by our preliminary results as reported in |12j and [23], both Rokhlin and 
Tygert [21] as well as Avron, Maymounkov, and Toledo [U [5] have empirically evaluated 
numerical implementations of variants of one of the algorithms we introduce. We describe 
this in more detail below in Section [1.31 

• Fourth, very recently, Clarkson and Woodruff proved space lower bounds on related prob- 
lems [8]; and Nguyen, Do, and Tran achieved a small improvement in the sampling com- 
plexity for related problems [20J. 
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1.3 Empirical performance of our randomized algorithms 

Although we have empirically evaluated randomized algorithms that rely on the algorithms we in- 
troduce in this paper in several large-scale data analysis tasks, it is a fair question to ask whether 
our "random perspective" on linear algebra will work well in high-quality numerical implemen- 
tations of interest in scientific computation. We address that question here. Although we do not 
provide such an empirical evaluation in this paper, in the wake of the original Technical Report 
version of this paper in 2007 |14| . two groups of researchers have unambiguously demonstrated 
that high-quality numerical implementations of variants of the algorithms we introduce in this 
paper can perform extremely well in practice. 

• In 2008, Rokhlin and Tygert [2T] describe a variant of our random projection algorithm, 
and they demonstrate that their algorithm runs in time 

O (log(£) + k log ( 1 /e)nd + d 2 £), 

where £ is an "oversampling" parameter and k is a condition number. Importantly (at least 
for very high-precision applications of this random sampling methodology) , they reduce the 
dependence on e from 1/e to log(l/e). Moreover, by choosing £ > Ad 2 , they demonstrate 
that k < 3. Although this bound is inferior to ours, they also consider a class of matrices 
for which choosing £ = Ad empirically produced a condition number k < 3, which means 
that for this class of matrices their running time is 

0(log(d) + «log(l/e)nd + d 3 ). 

Their numerical experiments on this class of matrices clearly indicate that their high-quality 
numerical implementations of variants of our algorithms perform well for certain matrices 
as small as thousands of rows by hundreds of columns. 

• In 2009, Avron, Maymounkov, Toledo [U E] introduced a randomized least-squares solver 
based directly on our algorithms. They call it Blendenpik, and by considering a much 
broader class of matrices, they demonstrate that their solver "beats LAPACK's direct 
dense least-sqares solver by a large margin on essentially any dense tall matrix." Be- 
yond providing additional theoretical analysis, including backward error analysis bounds 
for our algorithm, they consider five (and numerically implement three) random projection 
strategies (i.e., Discrete Fourier Transform, Discrete Cosine Transform, Discrete Hartely 
Transform, Walsh-Hadamard Transform, and a Kac random walk), and they evaluate their 
algorithms on a wide range of matrices of various sizes and various "localization " or "co- 
herence" properties. Based on these results that empirically show the superior performance 
of randomized algorithms such as those we introduce and analyze in this paper on a wide 
class of matrices, they go so far as to "suggest that random-projection algorithms should 
be incorporated into future versions of LAPACK." 

1.4 Outline 

After a brief review of relevant background in Section [21 Section [3] presents a structural result out- 
lining conditions on preconditioner matrices that are sufficient for relative-error approximation. 
Then, we present our main sampling-based algorithm for approximating least squares approxi- 
mation in Section U] and in Section [5] we present a second projection-based algorithm for the same 
problem. Preliminary versions of parts of this paper have appeared as conference proceedings in 
the 17th ACM-SIAM Symposium on Discrete Algorithms [12J and in the 47th IEEE Symposium 
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on Foundations of Computer Science [23]; and the original Technical Report version of this jour- 
nal paper has appeared on the arXiv p3]. In particular, the core of our analysis in this paper was 
introduced in |12j . where an expensive-to-compute probability distribution was used to construct 
a relative-error approximation sampling algorithm for the least squares approximation problem. 
Then, after the development of the Fast Johnson-Lindenstrauss transform [T], [23] proved that 
similar ideas could be used to improve the running time of randomized algorithms for the least 
squares approximation problem. In this paper, we have combined these ideas, treated the two 
algorithms in a manner to highlight their similarities and differences, and considerably simplified 
the analysis. 



2 Preliminaries 
2.1 Notation 

We let [n] denote the set {1,2,..., n}; log x denotes the natural logarithm of x and log 2 x denotes 
the base two logarithm of x. For any matrix A G M nxc( , A^,i G [n] denotes the i-th row of A 
as a row vector and A^\j G [d] denotes the j-th column of A as a column vector. Also, given a 
random variable X, we let E [X] denote its expectation and Var [X] denote its variance. 
We will make frequent use of matrix and vector norms. More specifically, we let 

n d 

\\ a \\f = X/X, ! 
t=i j=i 

denote the square of the Frobenius norm of A, and we let 

||A|| 2 = sup || 2 

x£R d , ||x|| 2 =l 

denote the spectral norm of A. For any vector x G M. n , its ^-norm (or Euclidean norm) is equal 
to the square root of the sum of the squares of its elements, while its norm is defined as 

Halloo = max ie[n] \ x i\- 



2.2 Linear Algebra background 

We now review relevant definitions and facts from linear algebra; for more details, see [24 1 116 1 171 IB], 
Let the rank of A G W ixd be p < min{n, d}. The Singular Value Decomposition (SVD) of A is 
denoted by A = C/^S^VT, where XJ a £ M nxp is the matrix of left singular vectors, S 



9X p 



is the diagonal matrix of non-zero singular values, and Va £ M. dxp is the matrix of right singular 
vectors. Let Oi{A),i G [p], denote the i-th non-zero singulcir value of -A, and <T max (A) and 
f"min(^4) denote the maximum and minimum singular value of A. The condition number of A is 
k(A) = cr max (A) I '<7 m i n (j4) . The Moore-Penrose generalized inverse, or pseudoinverse, of A may 
be expressed in terms of the SVD as A^ = Va^a^^a El- Finally, for any orthogonal matrix 
U G M. nxi , let U 1 - G R nx ( n ^) denote an orthogonal matrix whose columns are an orthonormal 
basis spanning the subspace of M. n that is orthogonal to the column space of U. In terms of Uj[, 
the optimal value of the least squares approximation problem of eqn. (pQ) is 



min \\Ax — 6IL 
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2.3 Markov's inequality and the union bound 

We will make frequent use of the following fundamental result from probability theory, known 
as Markov's inequality |19j . Let X be a random variable assuming non-negative values with 
expectation E [X]. Then, for all t > 0, 

X < t ■ E [X] 

with probability at least 1 — t . 

We will also need the so-called union bound. Given a set of probabilistic events £±, £2, ■ ■ ■ , £ n 
holding with respective probabilities pi,P2, ■ ■ ■ ,Pn, the probability that all events hold (a.k.a., the 
probability of the union of those events) is upper bounded by ^7=1 Pi- 



2.4 The Randomized Hadamard Transform 



The Randomized Hadamard Transform was introduced in pQ as one step in the development of a 
fast version of the Johnson-Lindenstrauss lemma [TJ [18] . Recall that the (non-normalized) n x n 
matrix of the Hadamard transform H n may be defined recursively as follows: 



H„ 



H 
H 



n/2 
n/2 



H 



n/2 



-H, 



n/2 



with 



Ho 



+1 +1 
+1 -1 



The n x n normalized matrix of the Hadamard transform is equal to -^H n ; hereafter, we will 
denote this normalized matrix by H. Now consider a diagonal matrix D £ W nxn such that 
Da is set to +1 with probability 1/2 and to —1 with probability 1/2. The product HD is the 
Randomized Hadamard Transform and has two useful properties. First, when applied to a vector, 
it "spreads out" its energy, in the sense of providing a bound for its infinity norm (see Section H2]). 
Second, computing the product HDx for any vector x G M. n takes 0(nlog 2 n) time. Even better, 
if we only need to access, say, r elements in the transformed vector, then those r elements can 
be computed in 0(n log 2 r) time [2]. We will expand on the latter observation in the proofs of 
Theorems [2] and El 



3 Our algorithms as preconditioners 

Both of our algorithms may be viewed as preconditioning the input matrix A and the target vector 
b with a carefully-constructed data-independent random matrix X. For our random sampling 
algorithm, we let X = S T HD, where S is a matrix that represents the sampling operation and 
HD is the Randomized Hadamard Transform, while for our random projection algorithm, we 
let X = THD, where T is a random projection matrix. Thus, we replace the least squares 
approximation problem of eqn. ([TJ with the least squares approximation problem 

Z= min \\X(Ax-b)\\o. (6) 

We explicitly compute the solution to the above problem using a traditional deterministic algo- 
rithm |16| . e.g., by computing the generalized inverse 

x opt = {XA) ] Xb. (7) 

Alternatively, one could use standard iterative methods such as the the Conjugate Gradient 
Normal Residual method (CGNR, see [16] for details), which can produce an e-approximation 
to the optimal solution of eqn. ([6]) in 0(n(XA)rd log(l/e)) time, where n(XA) is the condition 
number of XA and r is the number of rows of XA. 
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3.1 A structural result sufficient for relative-error approximation 



In this subsection, we will state and prove a lemma that establishes sufficient conditions on any 
matrix X such that the solution vector x op t to the least squares problem of eqn. © will satisfy 
relative-error bounds of the form © and ©. Recall that the SVD of A is A = U A T, A Vj. In 

. . . T 

addition, for notational simplicity, we let b = 
vector b lying outside of the column space of A. 

The two conditions that we will require of the matrix X are: 



U a Ua b denote the part of the right hand side 



v m in {XU A ) > 1/V2; and 



U\X T Xb A 



< eZ 2 /2, 



(8) 
(9) 



for some e G (0, 1). Several things should be noted about these conditions. First, although 
condition fl5J) depends on the right hand side vector b, Algorithms [T] and [2] will satisfy it without 
using any information from b. Second, although condition flSJ) only states that af(XU A ) > 1/V2, 
for all i £ [d], for both of our randomized algorithms we will show that |l -af(XU A )\ < 1-2" 1 / 2 , 
for all i G [d]. Thus, one should think of XU A as an approximate isometry. Third, condition Q 
simply states that Xb 1 - = XU^U^ T b remains approximately orthogonal to XU A . Finally, note 
that the following lemma is a deterministic statement, since it makes no explicit reference to 
either of our randomized algorithms. Failure probabilities will enter later when we show that our 
randomized algorithms satisfy conditions (jSJ) and @. 

Lemma 1 Consider the overconstrained least squares approximation problem of eqn. (CP and let 
the matrix U A G W nxd contain the top d left singular vectors of A. Assume that the matrix X 
satisfies conditions JSp and (0|) above, for some e G (0, 1). Then, the solution vector x op t to the 
least squares approximation problem fifty satisfies: 



\Axgpt — b\\ 2 < (l + e)Z, and 



| X pt X pt 1 1 2 — 



1 



■y/eZ. 



0~min (■A) 

Proof: Let us first rewrite the down-scaled regression problem induced by X as 



min - Xy4x||2 



mm 



mm 



X(Ax op t + 6 ) - XA(x op t + y) 

2 
2 



Xb A 



XAy 



mm 



Xb ± - XU A z 



(10) 
(11) 



(12) 



(13) 



(|12p follows since b = Ax op t + b 1 - and (|13p follows since the columns of the matrix A span the 
same subspace as the columns of U A . Now, let z op t G M d be such that U A z apt = A(xopt — Xopt), 
and note that z op t minimizes eqn. (|13|) . The latter fact follows since 



Xb L - XA(x 01 



opt 



J opt , 



Xb L - X{b - b 1 ) + XAxopt 



\XAx, 



opt 



Xb\\l 



Thus, by the normal equations ([3]), we have that 

(XU A ) T XU A z op t = {XU A ) T Xb A 
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Taking the norm of both sides and observing that under condition (J5|) we have ai((XU A ) T XU A ) 
af(XU A ) > l/y/2, for all i, it follows that 



2 



2 



IIMI2/2 < || (XUa) 1 XU A z opt \\ 2 = (XU A ) 1 Xb ± 2 . (14) 
Using condition ([9]) we observe that 

\\zopt\\l < eZ 2 . (15) 
To establish the first claim of the lemma, let us rewrite the norm of the residual vector as 

|| 6 ^4x p^ 1 1 2 — ||^ Ax opt -\- Ax pt Ax pt\\2 

= \\b ^4x D pt||2 + 11^4.3'opt ^4^opf 1 1 2 (16) 

= Z 2 + \\UAZopt\\l (17) 

< Z 2 + eZ 2 , (18) 

where (|16l) follows by Pythagoras, since b—Axopt = b^, which is orthogonal to A, and consequently 
to A(x op t — x op t); (fTTj) follows by the definition of z op t and Z; and (fTBj) follows by (fl~5j) and the 
orthogonality of U A . The first claim of the lemma follows since y/1 + e < 1 + e. 

To establish the second claim of the lemma, recall that A{x op t — x op t) = U A z op t- If we take 
the norm of both sides of this expression, we have that 

1 1 X opt X opt H2 _ 2 ( a\ 

(19) 

< ^ (20) 

where (|19p follows since a m i n (A) is the smallest singular value of A and since the rank of A is d0 
and (f20j) follows by (fT5j) and the orthogonality of U A . Taking the square root, the second claim 
of the lemma follows. 

o 

If we make no assumption on 6, then fill j) from Lemma Q] may provide a weak bound in terms of 
II x opt || n- Hi 011 the other hand, we make the additional assumption that a constant fraction of 
the norm of b lies in the subspace spanned by the columns of A, then (11 If) can be strengthened. 
Such an assumption is reasonable, since most least-squares problems are practically interesting if 
at least some part of b lies in the subspace spanned by the columns of A. 

Lemma 2 Using the notation of LemmaUl and assuming that 1 1 C/4 1/^" ^ 1 1 2 — 7 ll^lb; f or some 
fixed 7 E (0, 1] it follows that 



x pt - Xopt || 2 < \/e (k(A)V7 2 -l) ||^o P t|| 2 - ( 21 



Proof: Since ||^A^1&|| 2 > 7 ||fr|| 2 , it follows that 

Z 2 = \\bf 2 - \\U A U T A bt 



2V'2 



< (7- -1) \\U A UXb 



< cr max(^ 4 )(7 2 - 1) Ikoptlla 



x As x op i — a; p t lies in the row space of A, inequality (fl9)> still holds if rank(yl) < d. 
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This last inequality follows from V ' j^JJ^b = Axopt, which implies 

|| ^A^A^H 2 = II °P* II2 — II 2 II x opt II 2 = °"max {A) \\X0ptW2 ' 
By combining this with eqn. (jlip of Lemma [TJ the lemma follows. 

o 

4 A sampling-based randomized algorithm 

In this section, we present our randomized sampling algorithm for the least squares approximation 
problem of eqn. (pQ). We also state and prove an associated quality-of-approximation theorem. 

4.1 The main algorithm and main theorem 

Algorithm [1] takes as input a matrix A G M. nxd , a vector b G M n , and an error parameter e G (0, 1). 
This algorithm starts by preprocessing the matrix A and the vector b with the Randomized 
Hadamard Transform. It then constructs a smaller problem by sampling uniformly at random a 
small number of constraints from the preprocessed problem. Our main quality-of-approximation 
theorem (Theorem [2] below) states that with constant probability over the random choices made 
by the algorithm, the vector x op t returned by this algorithm will satisfy the relative-error bounds 
of eqns. @ and ([5]) and will be computed fast. 



Input: A G M nx ' i , b G R n , and an error parameter e G (0, 1). 
Output: x opt G 

1. Let r assume the value of eqn. (I22p . 

2. Let S be an empty matrix. 

3. For t = 1, . . . ,r (i.i.d. trials with replacement) select uniformly at random an 

integer from {1,2,..., n}. 

• If % is selected, then append the column vector {^s/njrj e% to S, where ej G W 1 
is an all-zeros vector except for its i-th entry which is set to one. 

4. Let H G M nxn be the normalized Hadamard transform matrix. 

5. Let D G M"^™ be a diagonal matrix with 

( +1 , with probability 1/2 
\ — 1 , with probability 1/2 

6. Compute and return x op t = (S T HDA^ S 7 HDb. 



Algorithm 1: A fast random sampling algorithm for least squares approximation 

In more detail, after preprocessing with the Randomized Hadamard Transform of Section \2. 41 
Algorithm [T] samples exactly r constraints from the preprocessed least squares problem, rescales 
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each sampled constraint by y/n/r, and solves the least squares problem induced on just those 
sampled and rescaled constraints. (Note that the algorithm explicitly computes only those rows 
of HDA and only those elements of HDb that need to be accessed.) More formally, we will let 
S G M. nxr denote a sampling matrix specifying which of the n constraints are to be sampled and 
how they are to be rescaled. This matrix is initially empty and is constructed as described in 
Algorithm [TJ Then, we can consider the problem 

Z = min \\S T HDAx - S T HDbL , 

which is just a least squares approximation problem involving the r constraints sampled from 
the matrix A after the preprocessing with the Randomized Hadamard Transform. The minimum 
^2-norm vector x op t G M rf among those that achieve the minimum value Z in this problem is 

x opt = {S T HDA) ] S T HDb, 
which is the output of Algorithm [TJ 

Theorem 2 Suppose A G R nxd , b G M. n , and let e G (0, 1). Run Algorithm^ witM 

r = max j(35c ) 2 (ilog(40n(i)log ((25c ) 2 dlog (40nd)) , 16dlog(40nd)/e} (22) 

and return x op t- Then, with probability at least .5, the following three claims hold: first, x op t 
satisfies 

\\Ax opt -b\\ 2 < (l + e)Z; 
second, if we assume that \\UAU^b\\ 2 > 7 ||6|| 2 for some 7 G (0, 1], then x op t satisfies 

\\x op t ~ x opt \\ 2 < \fe (k(A)vV~ 2 - l) \\xopt\\ 2 ; 

and third 

n(d + 1) + 2n{d + 1) log 2 (r + 1) + O (rd 2 ) 
time suffices to compute the solution x op t- 

Remark: Assuming that d < n < e , and using max{ai, 02} < a\ + a%, we get that 

d log n s 



r = |^(i(logd)(logn) + 

Thus, the running time of Algorith [JJ becomes 

^ ( 1 , d n M . d 3 log n 
O [nd log - + cP (log d) (log n) + 



e e 
Finally, assuming that j^jj = n(d 2 ), the ab ove running time reduces to 



d ndlogd, 
Uyndiog — h 



e e 



It is worth noting that improvements over the standard 0(nd 2 ) time could be derived with weaker 
assumptions on n and d; however, for the sake of clarity of presentation, we only focus on the 
above setting. 



2 Co is an unspecified constant in Theorem 3.1 of 
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Remark: The assumptions in our theorem have a natural geometric interpretation!! In par- 
ticular, they imply that our approximation becomes worse as the angle between the vector 
b and the column space of A increases. To see this, let Z = \\Ax op t — b\\2, and note that 
IHIi = II^A^J^Hl + Z 2 - Hence the assumption ||£/a£^J&||2 > 7IHI2 can be simply stated as 



Z < yl -7 2 ||0|| 2 . 

The fraction iJ/||&|| 2 is the sine of the angle between b and the column space of A; see page 242 
of [E]. Thus, \/^f~ 2 — 1 is a bound on the tangent between b and the column space of A; see 
page 244 of [16]. This means that the bound for \\x op t — x D pt||2 is proportional to this tangent. 

4.2 The effect of the Randomized Hadamard Transform 

In this subsection, we state a lemma that quantifies the manner in which HD approximately 
"uniformizes" information in the left singular subspace of the matrix A. We state the lemma for 
a general n x d orthogonal matrix U such that U T U = Id, although we will be interested in the 
case when n^> d and U consists of the top d left singular vectors of the matrix A. 

Lemma 3 Let U be an nx d orthogonal matrix and let the product HD be the nxn Randomized 
Hadamard Transform of Section\2.4\ Then, with probability at least 0.95, 



(HDU) 







< 



2(flog(40ncf) 



n 



for all i £ [n]. 



(23) 



Proof: We follow the proof of Lemma 2.1 in [TJ. In that lemma, the authors essentially prove that 
the Randomized Hadamard Transform HD "spreads out" input vectors. More specifically, since 
the columns of the matrix U (denoted by for all j G [d]) are unit vectors, they prove that 
for fixed j E [d] and fixed i£ [n], 



Pr 



HDU {j) 



> s 



< 2e~ s2n/2 . 



(Note that we consider d vectors in M. n whereas [T] considered n vectors in M. d and thus the roles 
of n and d are inverted in our proof.) Let s = \j2n~ 1 log(40racf) to get 



Pr 



< 



1 



20nd' 



HDU {j) j > ^J2n- 1 log(40nd) 
From a standard union bound, this immediately implies that with probability at least 1 — 1/20, 

( HDU U) ) 



< y/2n- 1 log(40nd) 



(24) 



holds for all i G [n] and j £ [d] . Using 



2 2dlog(40nd) 



i=i 



for all i € [n], we conclude the proof of the lemma. 



■n 



(25) 



We would like to thank Use Ipsen for pointing out to us this geometric interpretation. 
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4.3 Satisfying condition 

We now establish the following lemma which states that all the singular values of SHDUa are 
close to one. The proof of Lemma U] depends on a bound for approximating the product of a 
matrix times its transpose by sampling (and rescaling) a small number of columns of the matrix. 
This bound appears as Theorem [1] in the Appendix and is an improvement over prior work of 
ours in 



Lemma 4 Assume that Lemma{3\ holds. Ifr assumes the value of eqn. t29\) then, with probability 
at least .8, 

\l-c-f{S T HDU A )\<l--L, 

holds for all i G [d] . 

Proof: Note that for all i € [d] 

1 1 — o~f (S T HDU A ) | = \<Ti (U%DH T HDU A ) - Oi (U%DH T SS T HDU A ) \ 

< \\U%DH T HDU A - U%DH T SS T HDU A \\ 2 . (26) 

In the above, we used the fact that U\DH T HDU A = I d . We now can view U\DSS T H T HDU A 
as an approximation to the product of two matrices U^DH T = (HDU A ) T and HDU A by ran- 
domly sampling and rescaling columns of (HDU A ) T . We can now leverage Theorem H] from the 
Appendix. More specifically, consider the matrix (HDU A ) T . Obviously, since H, D, and U A are 
orthogonal matrices, \\HDU A \\ 2 = 1 and \\HDU A \\ F = \\U A \\ F = Vd. Let /3 = (2 log(40nd)) _1 ; 
using Lemma[3]we note that the columns of (H DU A ) T , which correspond to the rows of HDU A , 
satisfy 

(HDU A ) 



n \\HDUa\\f 



for all i € [n]. (27) 



\l 

Thus, applying Theorem 2] with e = 58/1000 implies that 

E [ \\U^DH T HU A - U^DH T SS T HDU A \\ 2 ] < 58/1000. (28) 
For the above bound to hold, we need to set r to be at least 

r > (35c ) 2 dlog (40nd) log ( (25c ) 2 dlog (40nd)) . (29) 



It is worth noting that an assumption of Theorem [J] necessitates that dc\ > 4/3e 2 ; this is trivially 
satisfied as long as cq is at least 1/20. From Markov's inequality, we get that with probability at 
least l-(l/5) 

\\U%DH T HU A ~ UjDH T SS T HDU A \\ 2 < 5(58/1000) = 58/200 < 1 - -L, 

V2 



which, combined with inequality (J26J), concludes the proof of the lemma. 
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4.4 Satisfying condition ([9]) 



We next prove the following lemma, from which it will follow that condition §§§ is satisfied by 
Algorithm [TJ The proof of this lemma depends on bounds for randomized matrix multiplication 
algorithms that appeared in 

Lemma 5 If Lemma\3\ holds and r = 16dlog(40nd)/e, then with probability at least .75, 



'S T HDU A ) T S T HDb ± <eZ 2 /2 



Proof: Recall that b x = U±U± T b and that Z 



| XJ\DH T HDb 1 - 1 1 2 



|J7J6-L||^ = it follows that 



We start by noting that since 



'S T HDU A ) T S T HDb 1 



UlDH T SS T HDb ± - UlDH T HDb A 



Thus, we can view (S T HDUa) T S T HDb ± as approximating the product of two matrices (HDUa) T 
and HDb 1 - by randomly sampling columns from (HDUa) T and rows/elements from HDb 1 -. Note 
that the sampling probabilities are uniform and do not depend on, e.g, the norms of the columns 
of {HDU A f or the rows of rlb^. However, we can still apply the results of Table 1 (second row) 
in page 150 of [llj. More specifically, since we condition on Lemma [3] holding, the rows of HDUa 
(which of course correspond to columns of (HDUa) T ) satisfy 



n 



(HDU A 



\\hdua\\ 2 f 



for all i 6 [n], 



for fi = (21og(40nd)) . Applying the result of Table 1 (second row) of |llj we get 

2 dZ 2 



E 



(S T HDU A ) S T HDb A 



<^\\HDU A \\ 2 F 

pr 



HDb 



fir 



(30) 



In the above we used H-ff-Dt/^lll, = d. Markov's inequality now implies that with probability at 
least .75, 



(S T HDU A ) S T HDb A 



< 



fir 



Setting r 



d/e and using the value of fi specified above concludes the proof of the lemma. 



4.5 Completing the proof of Theorem [2] 

In this subsection we complete the proof of Theorem [2j First, note that, from a standard union 
bound, Lemmas El 131 and [5] all hold with probability at least .5. Combining these lemmas with 
the structural results of Lemma[T]and setting r as in eqn. (j22|) concludes the proof of the accuracy 
guarantees of Theorem [2l 

We now discuss the running time of Algorithm [TJ First of all, by the construction of S, the 
number of non-zero entries in S is r. In Step 6 we need to compute the products S T HDA and 
S T HDb. Recall that A has d columns and thus the running time of computing both products is 
equal to the time needed to apply S T HD on (d + 1) vectors. First, note that in order to apply D 
on (d + 1) vectors in W 1 , n(d + 1) operations suffice. In order to estimate how many operations 
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are needed to apply S T H on (d + 1) vectors, we use the results of Theorem 2.1 (see also Section 
7) of Ailon and Liberty [2], which state that at most 2n(<i+l) log 2 (l^l + 1) operations are needed 
for this operation. Here \S\ denotes the number of non-zero elements in the matrix S, which is at 
most r. After this preprocessing, Algorithm Q] must compute the pseudoinverse of an r x d matrix, 
or, equivalently, solve a least-squares problem on r constraints and d variables. This operation 
requires 0(rd 2 ) time since r > d. Thus, the entire algorithm runs in time 

n(d + 1) + 2n{d + 1) log 2 (r + 1) + O (rd 2 ) . 



5 A projection-based randomized algorithm 

In this section, we present a projection-based randomized algorithm for the least squares approx- 
imation problem of eqn. ([T]). We also state and prove an associated quality-of-approximation 
theorem. 



5.1 The main algorithm and main theorem 

Algorithm [2] takes as input a matrix A G M nxd , a vector b G R n , and an error parameter 
e G (0, 1/2). This algorithm also starts by preprocessing the matrix A and right hand side vector b 
with the Randomized Hadamard Transform. It then constructs a smaller problem by performing 
a "sparse projection" on the preprocessed problem. Our main quality-of-approximation theorem 
(Theorem [3] below) will state that with constant probability (over the random choices made by 
the algorithm) the vector x op t returned by this algorithm will satisfy the relative-error bounds of 
eqns. (jU and ([5]) and will be computed fast. 

In more detail, Algorithm [2] begins by preprocessing the matrix A and right hand side vector 
b with the Randomized Hadamard Transform HD of Section 12.41 This algorithm explicitly 
computes only those rows of HDA and those elements of HDb that need to be accessed to 
perform the sparse projection. After this initial preprocessing, Algorithm [2] will perform a "sparse 
projection" by multiplying HDA and HDb by the sparse matrix T (described in more detail in 
Section f 5 . 2 [) . Then, we can consider the problem 

Z = min \\THDAx - THDb\\ 2 , 

which is just a least squares approximation problem involving the matrix THDA G R kxd and 
the vector THDb G R k . The minimum ^-norm vector x op t G R d among those that achieve the 
minimum value Z in this problem is 

x^t = (THDA) j THDb, 

which is the output of Algorithm [2l 

Theorem 3 Suppose A G R nxd , b G R n , and let e G (0, 1/2). Run Algorithm^ wit^ 

q > C ^ l0g(4 ° nd) (21ogn + 16d+13) (31) 
n 

k > max jc fc (ll8 2 d + 85 2 ) (32) 



4 



C q and Ck are the unspecified constants of Lemma [6] 
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Input: A G W axd , b G W n , and an error parameter e G (0, 1/2). 
Output: x opt G M d . 

1. Let q and assume the values of eqns. (|3ip and (I32p . 

2. Let T G M fcx ™ be a random matrix with 

+yf^ , with probability q/2 
Tij = ' -^~L ( with probability g/2 
, with probability 1 — 

for all i,j independently. 

3. Let H G M nxn be the normalized Hadamard transform matrix. 

4. Let D G M"^™ be a diagonal matrix with 



D; 



5. Compute and return x D pt — 

(THDAy THDb 



+1 , with probability 1/2 
— 1 , with probability 1/2 



Algorithm 2: A fast random projection algorithm for least squares approximation 

and return x op t- Then, with probability at least .5, the following three claims hold: first, x op t 
satisfies 

\\Ax op t-b\\ 2 < (l + e)Z; 
second, if we assume that \\UAU^b\\ 2 > 7 ||6|| 2 for some 7 G (0, 1] then x op t satisfies 

\\x op t ~ x opt \\ 2 < \fe (k(A)\/j- 2 - lj ||x p t || 2 ; 

and, third, the expected running time of the algorithm is (at most) 

n(d + 1) + 2n(d + 1) log 2 (nkq + 1) + O (kd 2 ) . 
Remark: Assuming that d < n < e d we get that 

o = o{^^) and k = (± 

Thus, the expected running time of Algorithm [2] becomes 

n / d d 3 \ 
O I ndlog- H I . 

Finally, assuming n = n(d 2 ), the ab ove running time reduces to 

. „ d nd 
U I nd log — I 

e e 
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It is worth noting that improvements over the standard 0(nd 2 ) time could be derived with weaker 
assumptions on n and d; however, for the sake of clarity of presentation, we only focus on the 
above setting. 

5.2 Sparse projection matrices 

In this subsection, we state a lemma about the action of a sparse random matrix operating on a 
vector. Recall that given any set of n points in Euclidean space, the Johnson-Lindenstrauss lemma 
states that those points can be mapped via a linear function to k = 0(e~ 2 logn) dimensions such 
that the distances between all pairs of points are preserved to within a multiplicative factor of 
l±e; see [18j and references therein for details. 

Formally, let e 6 (0,1/2) be an error parameter, 5 £ (0,1) be a failure probability, and 
a £ [1/y/n, 1] be a "uniformity" parameter. In addition, let q be a "sparsity" parameter defining 
the expected number of nonzero elements per row, and let k be the number of rows in our matrix. 
Then, define the kxn random matrix T as in Algorithm [2J Matousek proved the following lemma, 
as the key step in his version of the Ailon-Chazelle result [T| 118]. 

Lemma 6 Let T be the sparse random matrix of Algorithm^ where q = C q a 2 log(^) for some 
sufficiently large constant C q (but still such that q < 1), and k = Cfce~ 2 log(|j for some suffi- 
ciently large constant (but such that k is integral). Then for every vector x 6 W 1 such that 
\\x\\ I ||x|| 2 < at, we have that with probability at least 1 — 5 



Remark: In order to achieve sufficient concentration for all vectors x £ M. n , the linear mapping 
defining the Johnson-Lindenstrauss transform is typically "dense," in the sense that almost all 
the elements in each of the k rows of the matrix defining the mapping are nonzero. In this case, 
implementing the mapping on d vectors (in, e.g., a matrix A) via a matrix multiplication requires 
0(ndk) time. This is not faster than the 0(nd 2 ) time required to compute an exact solution 
to the problem of eqn. ([1]) if k is at least d. The Ailon-Chazelle result [H [18] states that the 
mapping can be "sparse," in the sense that only a few of the elements in each of the k rows need 
to be nonzero, provided that the vector x is "well-spread," in the sense that H^H^ / ||x|| 2 is close 
to 1/y/n. This is exactly what the preprocessing with the Randomized Hadamard Transform 
guarantees. 

5.3 Proof of Theorem H 

In this subsection, we provide a proof of Theorem [3l Recall that by the results of Section 13.11 
in order to prove Theorem [3j we must show that the matrix THD constructed by Algorithm [2] 
satisfies conditions (JS|) and ([9]) with probability at least .5. The next two subsections focus on 
proving that these conditions hold; the last subsection discusses the running time of Algorithm [2j 

5.4 Satisfying condition ([8]) 

In order to prove that all the singular values of THDUa are close to one, we start with the 
following lemma which provides a means to bound the spectral norm of a matrix. This lemma is 
an instantiation of lemmas that appeared in [3"1 1 15 j . 

Lemma 7 Let M be a d x d symmetric matrix and define the grid 



'J- ' X 1 1 2 || X 1 1 2 1 ^ 11*^* 




(33) 
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In words, SI includes all d- dimensional vectors x whose coordinates are integer multiples of 
2\fd\ and satisfy \\x\\ 2 < 1. Then, the cardinality of SI is at most e 4rf . In addition, if for 



every x,y 6 SI we have that \x My\ < e' , then for every unit vector x we have that \x Mx\ < 4e'. 

We next establish Lemma [HI which states that all the singular values of THDUa are close to one 
with constant probability. The proof of this lemma depends on the bound provided by Lemma [7] 
and it immediately shows that condition ([5]) is satisfied by Algorithm [2J 

Lemma 8 Assume that Lemma holds. If q and k assume the values of eqns. (31) and \38\) 
respectively, then, with probability at least .8, 

|1 - a? (THDUa) I <1-(1/V2) 

holds for all i € [d]. 

Proof: Define the symmetric matrix M = U^DH t T t THDUa - Id € R dxd , recall that I d = 
U^DH T HDU A , and note that 

{THDU A )\ < \\M\\ 2 (34) 

holds for all i € [d]. Consider the grid SI of eqn. (J33J and note that there are no more than e sd 
pairs (x,y) € SI x SI, since < e Ad by Lemma[71 Since ||M|| 2 = sup|| a .j| 2=1 |x T Mx|, in order to 
show that ||M|| 2 < 1 - 2- 1 / 2 , it suffices by Lemma[7]to show that \x T My\ < (l - 2" 1 / 2 ) /4, for 
all x,y G SI. To do so, first, consider a single x,y pair. Let 

Ax = \\THDU A {x + y)\\ 2 2 - \\HDU A {x + y)f 2 
A 2 = \\THDU A x\\ 2 2 - \\HDU A x\\l 
A 3 = \\THDU A y\\l - \\HDUAy\\l, 

and note that 

Ai = (x + y) T UlDH T T T THDU A (x + y) - (x + yf(x + y). 

By multiplying out the right hand side of the above equation and rearranging terms, it follows 
that 

x T My = x T UlDH T T T THDU A y -x T y = ^ (Ai + A 2 + A 3 ) . (35) 

In order to use Lemma [6] to bound the quantities Ai,A2, and A3, we need a bound on the 
uniformity ratio WHDUaxW^ / \\HDUax\\ 2 - To do so, note that 



\HDUaxWm maX i6[n] 



(HDU A ) {i) x 



\\HDU A x\\ 2 \\HDU A x\\ 2 



max ie [ n] 
< 



(HDUa 



(0 



x 



n 



The above inequalities follow by ||i?-DJ7^x|| 2 = ||x|| 2 and Lemma El This holds for both our 
chosen points x and y and in fact for all x ^ SI. Let ei = 3/125 and let 5 = l/(15e 8d ) (these choices 
will be explained shortly). Then, it follows from Lemma [6] that by setting a = \J 2d log (40nd) /n 
and our choices for k and q (see eqns. (|38p and (|37p below), each of the following three statements 
holds with probability at least 1 — 5: 

I Ai| < ei|| J ffDC/ yl (x + y)||2 = ei||x + y||2<4ei 
IA2I < e\ \\HDUax\\^ = t\ \\x\\ 2 < e\ 
|A 3 | < e l \\HDU A y\\l = e l \\y\\ 2 <e 1 . 
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Thus, combining the above with eqn. (|35p . for this single pair of vectors (x,y) E f2 x Q, 



\x T My\ = \x T UlDH T T T THDU A y - x T y\ < ^6ei = 3ei (36) 

holds with probability at least 1 — 35. Next, recall that there are no more than e 8d pairs of vectors 
(x, y) E fix f2, and we need eqn. ()36f) to hold for all of them. If we set 5 = l/(15e M ) then it follows 
by a union bound that eqn. (|36p holds for all pairs of vectors (x, y) E SI x f2 with probability at 
least .8. Additionally, let us set ei = 3/125, which implies that \x T My\ < 9/125 < (l - 2" 1 / 2 ) /4 
thus concluding the proof of the lemma. 

Finally, we discuss the values of the parameters q and k. Since 5 = l/(15e M ), e\ = 3/125, 
and a = 2d\og(4Qnd) Jn, the appropriate value for q from Lemma E] is (after some elementary 
manipulations) 

<7 > — 1 (2 log n + 16a + 13) . (37) 

n 

Similarly, 

k > C k (H8 2 (i + 85 2 ) . (38) 
Here C q and Ck are the unspecified constants of Lemma 



5.5 Satisfying condition (|HD 

In order to prove that condition ([9]) is satisfied, we start with Lemma [9j In words, this lemma 
states that given vectors x and y we can use the random sparse projection matrix T to approximate 
\x T y\ by |x T T T Ty|, provided that H^H^ (or HyH^, but not necessarily both) is bounded. The 
proof of this lemma is elementary but tedious and is deferred to Section 16.21 of the Appendix. 

Lemma 9 Let x,y be vectors in R n such that \\x\\ < a. Let T be the k x n sparse projection 
matrix of Section lo^E, with sparsity parameter q. If q > a 2 , then 



E 



I TrrTrri T I 2 

x 1 1 y — x y\ 



^ II II 2 II I ■ 2 . 1 || ii2 

<^lkll 2 Ily|l2 + ^llyll2 



The following lemma proves that condition @ is satisfied by Algorithm [21 The proof of this 



lemma depends on the bound provided by Lemma [9j Recall that b 1 - = UjrUj^b and thus 



uiufb 



z. 



Lemma 10 Assume Lemma\3\ holds. If k > 24d/e and q > 2n log(40n<i), then, with probability 
at least .75, 

2 

(THDUa) T THDb ± <eZ 2 /2. 



Proof: We first note that since U^b A 
all j E [d]. Thus, we have that 



0, it follows that ujf 6 J 



l/jp DH T HDb A 



UjDH T T T THDb A 



d 2 

(((HDU A ) U) f TT ™ Db± - U { A j)T DH T HDbA . 

7=1 ^ ' 



0, for 



(39) 



We now bound the expectation of the left hand side of eqn. (|39p by using Lemma [9] to bound 
each term on the right hand side of eqn. (|39p . Using eqn. (|24p of Lemma [3] we get that 



(HDUa)^ 



< y / 2n- 1 log(40nd) 
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holds for all j £ [d] . By our choice of the sparsity parameter q the conditions of Lemma [9] are 
satisfied. It follows from Lemma [9] that 



E 



UlDH T T T THDb A 



d 



(HDU A 



,U)\ 7 ^THDb 1 - - Uf 7 DH T HDb L 



3d 



A 



,(J) 



2 1 

2 fc 



A: 



\HDb A 



2 3d 



Z 2 . 



The last line follows since 



(HDU A 



that with probability at least .75, 



2 

1, for all j G [d]. Using Markov's inequality, we get 



UXDH T T T THDb 



2 12d 

< 



2 

The proof of the lemma is concluded by using the assumed value of k. 



5.6 Proving Theorem [3] 

By our choices of k and q as in eqns. (|32j) and (|31[). it follows that both conditions (|HJ) and ([9]) 
are satisfied. By a standard union bound, Lemmas (3j [H and [9] all hold with probability at least 
.5. Combining with Lemma [U we immediately get the accuracy guarantees of Theorem [3l 

In order to complete the proof we discuss the running time of Algorithm (2) First of all, by 
the construction of T, the expected number of non-zero entries in T is kqn. In Step 5 we need 
to compute the products THDA and THDb. Recall that A has d columns and thus the running 
time of computing both products is equal to the time needed to apply THD on (d + 1) vectors. 
First, note that in order to apply D on (d + 1) vectors in R n , n(d + 1) operations suffice. In 
order to estimate how many operations are needed to apply TH on (d + 1) vectors, we use the 
results of Theorem 2.1 (see also Section 7) of Ailon and Liberty [2], which state that at most 
2n(d + 1) log 2 (|T| + 1) operations are needed for this operation. Here |T| denotes the number 
of non-zero elements in the matrix T, which - in expectation - is nkq. After this preprocessing, 
Algorithm [2] must compute the pseudoinverse of a k x d matrix, or, equivalently, solve a least- 
squares problem on k constraints and d variables. This operation requires 0(kd 2 ) time since 
k > d. Thus, the entire algorithm runs in expected time 

n(d + 1) + 2n(d + 1)E [log 2 (\T\ + 1)] + O (kd 2 ) < n(d + 1) + 2n(d + 1) log 2 (nkq + 1) + O (kd 2 ) . 



References 

[1] N. Ailon and B. Chazelle. Approximate nearest neighbors and the fast Johnson-Lindenstrauss 
transform. In Proceedings of the 38th Annual ACM Symposium on Theory of Computing, 
pages 557-563, 2006. 

[2] N. Ailon and E. Liberty. Fast dimension reduction using Rademacher series on dual BCH 
codes. In Proceedings of the 19th Annual ACM-SIAM Symposium on Discrete Algorithms, 
pages 1-9, 2008. 



19 



[3] S. Arora, E. Hazan, and S. Kale. A fast random sampling algorithm for sparsifying matrices. 
In Proceedings of the 10th International Workshop on Randomization and Computation, 
pages 272-279, 2006. 

[4] H. Avron, P. Maymounkov, and S. Toledo. Blendenpik: Supercharging LAPACK's least- 
squares solver. Manuscript. (2009). 

[5] H. Avron, P. Maymounkov, and S. Toledo. Blendenpik: Supercharging LAPACK's least- 
squares solver. To appear in: SIAM Journal on Scientific Computing, 2010. 

[6] A. Ben-Israel and T.N.E. Greville. Generalized Inverses: Theory and Applications. Springer- 
Verlag, New York, 2003. 

[7] R. Bhatia. Matrix Analysis. Springer- Verlag, New York, 1997. 

[8] K.L. Clarkson and D.P. Woodruff. Numerical linear algebra in the streaming model. In 
Proceedings of the J^lst Annual ACM Symposium on Theory of Computing, pages 205-214, 
2009. 

[9] D. Coppersmith and S. Winograd. Matrix multiplication via arithmetic progressions. Journal 
of Symbolic Computation, 9(3):251-280, 1990. 

[10] G. Dahlquist, B. Sjoberg, and P. Svensson. Comparison of the method of averages with the 
method of least squares. Mathematics of Computation, 22(104):833-845, 1968. 

[11] P. Drineas, R. Kannan, and M.W. Mahoney. Fast Monte Carlo algorithms for matrices I: 
Approximating matrix multiplication. SIAM Journal on Computing, 36:132-157, 2006. 

[12] P. Drineas, M.W. Mahoney, and S. Muthukrishnan. Sampling algorithms for li regression 
and applications. In Proceedings of the 17th Annual ACM-SIAM Symposium on Discrete 
Algorithms, pages 1127-1136, 2006. 

[13] P. Drineas, M.W. Mahoney, and S. Muthukrishnan. Relative-error CUR matrix decomposi- 
tions. SIAM Journal on Matrix Analysis and Applications, 30:844-881, 2008. 

[14] P. Drineas, M.W. Mahoney, S. Muthukrishnan, and T. Sarlos. Faster least squares approxi- 
mation. Technical report. Preprint: arXiv:0710.1435 (2007). 

[15] U. Feige and E. Ofek. Spectral techniques applied to sparse random graphs. Random Struc- 
tures and Algorithms, 27(2):251-275, 2005. 

[16] G.H. Golub and C.F. Van Loan. Matrix Computations. Johns Hopkins University Press, 
Baltimore, 1996. 

[17] O.H. Ibarra, S. Moran, and R. Hui. A generalization of the fast LUP matrix decomposition 
algorithm and applications. Journal of Algorithms, 3:45-56, 1982. 

[18] J. Matousek. On variants of the Johnson-Lindenstrauss lemma. Random Structures and 
Algorithms, 33(2): 142-156, 2008. 

[19] R. Motwani and P. Raghavan. Randomized Algorithms. Cambridge University Press, New 
York, 1995. 

[20] N.H. Nguyen, T.T. Do, and T.D. Tran. A fast and efficient algorithm for low-rank ap- 
proximation of a matrix. In Proceedings of the 41st Annual ACM Symposium on Theory of 
Computing, pages 215-224, 2009. 



20 



[21] 



V. Rokhlin and M. Tygert. A fast randomized algorithm for overdetermined linear least- 
squares regression. Proc. Natl. Acad. Sci. USA, 105(36):13212-13217, 2008. 



[22] M. Rudelson and R. Vershynin. Sampling from large matrices: an approach through geo- 
metric functional analysis. Journal of the ACM, 54(4):Article 21, 2007. 

[23] T. Sarlos. Improved approximation algorithms for large matrices via random projections. 
In Proceedings of the J^lth Annual IEEE Symposium on Foundations of Computer Science, 
pages 143-152, 2006. 

[24] G.W. Stewart and J.G. Sun. Matrix Perturbation Theory. Academic Press, New York, 1990. 

[25] S.M. Stigler. The History of Statistics: The Measurement of Uncertainty before 1900. Har- 
vard University Press, Cambridge, 1986. 

6 Appendix 

6.1 Approximating matrix multiplication 

Let A G W nxn be any matrix. Consider the following algorithm (which is essentially the algorithm 
in page 876 of [13]) that constructs a matrix C G M mxc consisting of c rescaled columns of A. 
We will seek a bound on the approximation error ||^4^4 T — CC T || 2 , which we will provide in 
Theorem |H A variant of this theorem appeared as Theorem 7 in [13J; this version modifies and 
supersedes eqn. (47) of Theorem 7 in the following manner: first, we will assume that the spectral 
norm of A is bounded and is at most one. This is a minor normalization assumption and we make 
it mainly to be compatible with Theorem 3.1 of [22]. Second, and most importantly, we will need 
to set c to be at least the value of eqn. for the theorem to hold. This second assumption was 
omitted from the statement of eqn. (47) in Theorem 7 of [13| . 



Data : A G R mxn , pi > 0,i e [n] s.t. Y2ie[n]Pi = -*-> positive integer c < n. 
Result : C G W mxc 

Initialize S G M. mxc to be an all-zero matrix, 
for t = 1, . . . ,c do 



Theorem 4 Let A G W mxn with \\A\\ 2 < 1. Construct C using the Exactly(c) algorithm and 
let the sampling probabilities pi satisfy 




end 

Return C = AS; 



Algorithm 3: The Exactly(c) algorithm. 




(40) 
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for all i E [n] for some constant j3 G (0,1]. Let e G (0,1) be an accuracy parameter, assume 
Cq \\A\\ 2 f > 4/3e 2 , and let 



c = 2 



4U\\ 

/3e 2 



log 



r 2 WAW 2 
(3e 2 



(Here cq is the unknown constant of Theorem 3.1, p. 8 of 122\/ .) Then, 

E[||^ T -CC T || 2 ] < e. 
Proof: Consider the EXACTLY(c) algorithm. Then 

n 

AA T = Y,A^A^ T . 

2 = 1 

Following the lines of [22j (p. 11) we shall view the matrix AA T as the true mean of a bounded 
operator valued random variable, whereas CC T = AS{AS) T = ASS T A T will be its empirical 
mean. Then, we will apply Theorem 3.1 of |22j . To this end, define a random vector y 6 M m as 



Pr 



1 



for i G [n]. The matrix C = AS has columns -^Ui, ^?/2> • • • , where yi,y2, ■ ■ ■ ,y c are c 

independent copies of y. Using this notation, it follows that 



and 



Finally, let 



E [yy T ] = AA T 
CC T = ASS T A T = l -j^ yt yl 



(41) 



t=i 



M= \\y\\ 



A 



(i)h 



(42) 



We can now apply Theorem 3.1, p. 8 of [22]. Notice that from eqn. (|4ip and our assumption on 
the spectral norm of A, we immediately get that 



|E [yy 1 



\AA T \\ 2 < \\A\\ 2 \\A T \\ 2 < 1. 



Then, clause (i) of Theorem 3.1 implies that 

E [||CC T -^ T || 2 ] < 



a, 



(43) 



where a = coy^^-M, for some unknown constant cq and M as in eqn. (I42p . However, in order 
for the above inequality to hold, it is necessary that a is at most one. In order to prove the 
theorem, we use the bound on the probabilities pi from eqn. (140p . Combining eqns. (142j) and 
p)j) we get 

1 



- v^" F 



Thus, 



' log c 

a = c \l M < c a 



logc I, AU 



(44) 
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and we need the last quantity in the above equation to be at most one. Let e G (0, 1) be the 
accuracy parameter and let c (the number of sampled columns) be equal to 




We now use the fact that for any rj > 4, if c > 2?? log 77 then > r/. Let r\ = c 2 , \\A\\^ / (/3e 2 ) 
and note that 77 > 4 from our assumption in the statement of the Theorem. Then, 



logc ^ /3e 2 

1 I /■' 

Combining eqns. (|45p and (|44p we get 



£ dtiF- <45) 



a < e < 1, (46) 

and thus the assumption of Theorem 3.1 on a is satisfied. Combining eqns. ()43|) and (|46p we get 
the statement of the theorem. 



6.2 The proof of Lemma P 

Let T G M fcxn be the sparse projection matrix constructed via Algorithm [2] (see Section f5 . 2 [) . with 
sparsity parameter q. In addition, given x,y € M n , let A = x T T T Ty — x T y. We will provide 
bound 

E [A 2 ] =E[(x T T T Ty-x T y) 2] 
Let be the z-th row of T, for i G [k], in which case 



A = E (°?*foH<)v - \ xT y) ■ 
i=i ^ ' 



Rather than computing E [A 2 ] directly, we will instead use that E [A 2 ] = (E [A]) 2 + Var [A]. 
We first claim that E [A] = 0. By linearity of expectation, 



e[a] = £; 



i=i 



E 



1 rp 



(47) 



We first analyze irq = t for some fixed i (w.l.o.g. i = 1). Let ti denote the z-th element of the 
vector t and recall that E \tj\ = 0, E [Utj] = for % ^ j, and also that E [i 2 ] = l/k. Thus, 



E [x T t T ty] = E 



x iUtjyj 

1=1 j=i 



n n n - 



i=l j=l 



By combining the above with eqn. ([4"7]) . it follows that E [A] = 0, and thus that E [A 2 ] = Var [A]. 
In order to provide a bound for Var [A] , note that 



Var [A] = Var 



i=l 
k 



rp rp ± rp 

x HoHw '- t x y 



Var x^t^y 



i=l 



(48) 
(49) 
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Eqn. (|48p follows since the k random variables x T tJ^t^y — \x T y are independent (since the 

elements of T are independent) and eqn. (I49h follows since \x T y is constant. In order to bound 
eqn. (|49p . we first analyze tu-\ = t for some i (w.l.o.g. i = 1). Then, 



Var [x T t T ty] = E [(x T t T ty) 2 ] - (E [x T t T £y])' 

1 



E[(x^) 2 ]-7i(^) 2 - 



(50) 



We will bound the E \[x T t T ty) 21 ^ term directly: 

2- 



E 



i=i j=i 



E 



n n n n 



y^ y^ y^ ^ia ^ii^i2%i%2 VoiVh 

11=1 12 = 1 31 = 1 32=1 



n n n n 



E E E E^ 1 ^ 2 ^^ 1 ^ 2 ^ 1 ^ 2 ]^ 1 ^ 2, 

n=i«2=iii=i j2=i 



Notice that if any of the four indices ii,i2,ji,j2 appears only once, then the expectation E [ti 1 ti 2 tj 1 tj 2 ] 
corresponding to those indices equals zero. This expectation is non-zero if the four indices are 
paired in couples or if all four are equal. That is, non-zero expectation happens if 



(A) 


h 


= k + h 


= h 


(re 2 — re terms) 


(B) 


k 


= h + k 


= 32 


(n 2 — n terms) 


(C) 


k 


= h + k 


= h 


(n 2 — n terms) 


(D) 


k 


= k = jl 


= 32 


(n terms). 



For case (A), let k = i-i = £ and let j\ = j'2 = p, in which case the corresponding terms in 
eqn. ([5T]) become: 



E E rfv[t 2 t 2 p } y 2 p = e E x ' E [*?] E K] v. 

l=\ p=l :p ^£ l=\ p=\:p^l 



n n 



k 2 E E 



2 2 
x lVp 



= 1 p=l:p^e 
n n 



^2 E E x * y p + ^2 E 4 

l=\ p=\;p^t p=l 



1 



1 

i 1 1 2 || m2 j_ ST~^ 2 2 

\ x \\2 \\y\\2 r.2 x pyp- 



p=l 



1 - 
12 E 



A: 2 



2 2 

x pUp 



P =i 



Similarly, cases (B) and (C) give: 



n n 

E E x ^ x p e [*f*p] y^p 


= ^yf 


1 

~ ^ 


n 

E 


£=1 p=l : p^£ 


p=i 




(where i\ 


= 31 


= £ 


ra n 

E E x ^ x pE [ifi 2 ] yajp 


= h {xTy)2 


1 

~ ¥ 


n 

E 


1 = 1 p=\;p£t 


p=i 




(where i\ 


= 32 


= 1 



Otlp 
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Finally, for case (D), let i\ = i% = j\ = j% = £, in which case: 

n \ n 

where we have used that E [i^] = l/(k 2 q). By combining these four terms for each of the k terms 
in the sum, it follows from eqns. (|49p and (|50p that 

y p=i y p=i 

^ II 1 1 2 || m2 . 1 2 2 /r \ 

< ^IfII 2 Wyh + fa 1^ x pyp- ( 52 ) 
P =i 



In the above we used (x T y) 2 < \\x\\ 2 \\y\\ 2 - Since we assumed that ||a;|| ^ < a, the second term 

"ill II 2 
k q \\y\\2 

assumed that q > a 2 . 



on the right hand side of eqn. (I52h is bounded by j- H2/H2 and the lemma follows since we have 
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